Search CORE

15 research outputs found

RasBhari: optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison

Author: Hahn Lars
Leimeister Chris-André
Lonardi Stefano
Morgenstern Burkhard
Ounit Rachid
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 20/07/2016
Field of study

Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don't-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de

arXiv.org e-Print Archive

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Phylogeny reconstruction based on the length distribution of k-mismatch common substrings

Author: Burkhard Morgenstern
Chris-André Leimeister
Svenja Schöbel
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/12/2017
Field of study

Abstract Background Various approaches to alignment-free sequence comparison are based on the length of exact or inexact word matches between pairs of input sequences. Haubold et al. (J Comput Biol 16:1487–1500, 2009) showed how the average number of substitutions per position between two DNA sequences can be estimated based on the average length of exact common substrings. Results In this paper, we study the length distribution of k-mismatch common substrings between two sequences. We show that the number of substitutions per position can be accurately estimated from the position of a local maximum in the length distribution of their k-mismatch common substrings

Directory of Open Access Journals

Recommended from our members

rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison.

Author: Hahn Lars
Leimeister Chris-André
Lonardi Stefano
Morgenstern Burkhard
Ounit Rachid
Publication venue: eScholarship, University of California
Publication date: 01/10/2016
Field of study

eScholarship - University of California

Pattern sets for short read classification.

Author: Burkhard Morgenstern (7977)
Chris-André Leimeister (3229197)
Lars Hahn (3229200)
Rachid Ounit (3229194)
Stefano Lonardi (76597)
Publication venue
Publication date
Field of study

Pattern sets used for short read classification: (A) as used by default in CLARK-S, (B) generated with rasbhari minimizing overlap complexity and (C) generated with rasbhari maximizing sensitivity.</p

FigShare

Sensitivity comparison of different programs.

Author: Burkhard Morgenstern (7977)
Chris-André Leimeister (3229197)
Lars Hahn (3229200)
Rachid Ounit (3229194)
Stefano Lonardi (76597)
Publication venue
Publication date
Field of study

Sensitivity comparison of different programs.</p

FigShare

Read classification with CLARK-S using different pattern sets.

Author: Burkhard Morgenstern (7977)
Chris-André Leimeister (3229197)
Lars Hahn (3229200)
Rachid Ounit (3229194)
Stefano Lonardi (76597)
Publication venue
Publication date
Field of study

Read classification with CLARK-S using different pattern sets.</p

FigShare

overlap complexity of pattern sets in the hill-climbing algorithm.

Author: Burkhard Morgenstern (7977)
Chris-André Leimeister (3229197)
Lars Hahn (3229200)
Rachid Ounit (3229194)
Stefano Lonardi (76597)
Publication venue
Publication date
Field of study

Normalized overlap complexity (OC) of pattern sets depending on the number of iteration steps in our algorithm. The first two plots show how the OC is reduced in a single round of the hill-climbing algorithm for different parameters. For a set of m = 10 patterns of length ℓ = 14 and weight w = 8, the algorithm converges after around 3,000 iteration steps of hill-climbing (upper plot); for a set of m = 20 patterns of length ℓ = 44 and weight w = 14, it converges after around 80,000 steps (middle plot). The lower plot shows how the OC is improved if the hill-climbing algorithm is run multiple times and the best result of all runs is returned.</p

FigShare

Homolgue and background contribution to the variance of the number N of spaced-word matches.

Author: Burkhard Morgenstern (7977)
Chris-André Leimeister (3229197)
Lars Hahn (3229200)
Rachid Ounit (3229194)
Stefano Lonardi (76597)
Publication venue
Publication date
Field of study

Contribution of the homologue and background variance to the total variance of the number N of spaced-word matches in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1005107#pcbi.1005107.e032" target="_blank">eq (4)</a> for different match probabilities p and sequence lengths L.</p

FigShare